Gradient Descent: Second Order Momentum and Saturating Error
نویسنده
چکیده
Batch gradient descent, ~w(t) = -7JdE/dw(t) , conver~es to a minimum of quadratic form with a time constant no better than '4Amax/ Amin where Amin and Amax are the minimum and maximum eigenvalues of the Hessian matrix of E with respect to w. It was recently shown that adding a momentum term ~w(t) = -7JdE/dw(t) + Q'~w(t 1) improves this to ~ VAmax/ Amin, although only in the batch case. Here we show that secondorder momentum, ~w(t) = -7JdE/dw(t) + Q'~w(t -1) + (3~w(t 2), can lower this no further. We then regard gradient descent with momentum as a dynamic system and explore a non quadratic error surface, showing that saturation of the error accounts for a variety of effects observed in simulations and justifies some popular heuristics.
منابع مشابه
Handwritten Character Recognition using Modified Gradient Descent Technique of Neural Networks and Representation of Conjugate Descent for Training Patterns
The purpose of this study is to analyze the performance of Back propagation algorithm with changing training patterns and the second momentum term in feed forward neural networks. This analysis is conducted on 250 different words of three small letters from the English alphabet. These words are presented to two vertical segmentation programs which are designed in MATLAB and based on portions (1...
متن کاملTo appear in the proceedings of NIPS*4, Morgan Kaufmann 1992. Gradient Descent: Second-Order Momentum and Saturating Error
Batch gradient descent, w(t) = dE=dw(t), converges to a minimum of quadratic form with a time constant no better than 1 4 max = min where min and max are the minimum and maximum eigenvalues of the Hessian matrix of E with respect to w. It was recently shown that adding a momentum term w(t) = dE=dw(t) + w(t 1) improves this to 1 4 p max = min , although only in the batch case. Here we show that ...
متن کاملAn Investigation of the Gradient Descent Process in Neural Networks
Usually gradient descent is merely a way to find a minimum, abandoned if a more efficient technique is available. Here we investigate the detailed properties of the gradient descent process, and the related topics of how gradients can be computed, what the limitations on gradient descent are, and how the second-order information that governs the dynamics of gradient descent can be probed. To de...
متن کاملMomentum and Optimal Stochastic Search
The rate of convergence for gradient descent algorithms, both batch and stochastic, can be improved by including in the weight update a “momentum” term proportional to the previous weight update. Several authors [1, 2] give conditions for convergence of the mean and covariance of the weight vector for momentum LMS with constant learning rate. However stochastic algorithms require that the learn...
متن کاملOn the importance of initialization and momentum in deep learning
Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train...
متن کامل